Antonio Debouse, Bodie Franklin, Eric Romero
Credit card companies are always in search of better ways to monitor borrowers to determine if the credit card borrower will default on their credit card payments or make them in full. Defaulted credit card payments are often difficult to recoup and create losses for these companies. Defaulting on a payment is defined as not meeting the debt obligation (which is the credit card payment). Our dataset is composed of 24 attributes and 30,000 records that reflect a Taiwanese credit card borrower’s payment history over a six month period.The data was pulled from UCI machine learning repository. The purpose of the dataset is to provide attributes at different points in their payment history to identify if a credit card borrower will default on their payments or pay in full. Since the dataset captures six payment periods, it gives the credit card firm a chance to identify if default will occur or not in various billing cycles.
The effectiveness of a good classification algorithm is one that produces strong accuracy, sensitivity, and specificity scores through cross validation. If an effective classification model can be built, the credit company will have the ability to proactively monitor borrowers in various credit stages.
We will be using clustering methods to see if we can improve the accuracy of our best model from Lab 2 which was the Random Forest model. To measure the effectiveness of each cluster method we will look at the silhouette score which will measure the mean distances between the clusters that are identified in each method. Silhouette values near 1 will indicate more distinct clusters which should improve the base model. Low silhouette scores will suggest that the identified clusters overlap with each other and will do little to enchance the model.
We chose to use a K-Folds cross validation method as it would create a less biased model. This method also uses every observation from our data within the training/test sets which is important given any initial imbalances that exist.
Oversampling was also used to limit the potential bias that would exist in our target attribute being heavy skewed towards clients that did not default.
The significance of identifying default or not will allow the credit card to minimize their losses. If early default identification occurs, the credit card company can reduce the borrower’s credit limits or preemptively work with the borrower to create new repayment plans. Both outcomes will help the credit company reduce their losses that would occur if no action were taken.
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
DataMeaningType = pd.DataFrame(
{'Attribute': ['LIMIT_BAL', 'Gender:', 'Education:','Marital status:','Age:','PAY_0 to PAY6:','BILL_AMT1 to BILL_AMT6:','PAY_AMT1 to PAY_AMT6:','Default payment next month:'],
'Data Type ': ['Nomminal scale', 'Categorical', 'Ordinal scale','Categorical scale','Numerical/Nomminal','Categorical scale','Nominal scale','Nominal scale','Categorical scale'],
'Description': ['Combined total of credit (amount of money) given to the individual borrower and their family.', '1 represents male and 2 represents female.', '1 represents the highest level of education and 4 would be the lowest. 1 = graduate school, 2 = university, 3 = high school and 4 = others.','1 = married, 2 = single, 3 = others. Value 0 is undefined.','Measures how old a borrower is.','Categorical scale','These attributes describe the past monthly payment status of each made. For example, PAY_0 represents the payment status in September 2005 and PAY_6 represents the payment status in April 2005. -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.','This value represents the amount of the credit card bill paid in each respective month','1 represents a default or missed payment. 0 represents payment made.']})
pd.set_option("max_colwidth", 3000)
DataMeaningType
More detailed explaination of attributes can be found at: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('C:/Users/bodie/Documents/credit_card.csv')
#Deleting ID as useless variable
del df['ID']
#Dropping education variables 0,5,6 & Marriage status 0
#since we don't know what these are per UCI page and these are such a small portion of data less than 3%
df_new = df[(df.EDUCATION !=0)&(df.EDUCATION !=5) &
(df.EDUCATION !=6) & (df.MARRIAGE!= 0)]
#Creating Backup for reference.
df_base = df_new
df_base_education = pd.DataFrame(df_new,columns = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','PAY_0',
'PAY_2','PAY_3','PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2',
'BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2',
'PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6','default_payment_next_month'])
The data that was pulled was fairly clean to start when reviewing this data. However we did notice some factors that were not in the defined range of the data. This was apparent in categorical columns of Education and Marriage. Education had 3 additional values of 0, 5, 6 which occurred 345 times out of the 30,000 values in this column. Marriage had a value of 0 which occurred 54 times out of the 30,000 values. We addressed both of these ambiguous values in the data by recoding them as "other" category for each respective attribute. The assumption behind this was that they were small amounts of data that did not represent the major categories and therefore could be moved into the "other" category with little effect on the overall data set.
import pandas as pd
import missingno as msno
msno.bar(df)
# First look at the Data Set
df.head()
df.isnull().sum()
Based on the above chart there are no missing values for any attribute in the data set.
sns.set(font_scale=0.8)
corrMatrix = df.corr()
plt.figure(figsize=(16, 10))
sns.heatmap(df.corr(), annot=True)
Utilizing a correlation plot we first identified attributes that showed strong relationships to those that defaulted. This was the highest for history of past payments which recorded data on payment delays before defaulting. Payment delay data specifically at the earlier time frames showed the highest correlation to default data. A graph of the correlation matix is seen above. When reviewing the graph we identified that only Pay 0 through Pay 6 had a high correlation to the variable of interest of default payment next month. This finding is consistent with our analysis of prepayments. If a borrower prepays which will be reflected in variables PAY 0 to PAY 6, then the borrower isn’t defaulting and a deferred amount increases the likelihood that the borrower defaults.
# pairwise plots of features
df_sub_PAY = df[["default_payment_next_month","PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]] #splitting attributes based on high correlation
sns.pairplot(df_sub_PAY, hue="default_payment_next_month", height=2)
plt.rcParams.update({'font.size': 15})
plt.rcParams["figure.figsize"] = (20,20)
The scatterplot above shows correlations between PAY 0 – 6 variables with each respective PAY attribute. If a borrower prepays or defers in a previous month, the borrower is likely to repeat the payment history in the next month. This statement is visually apparent since non defaulters are experiencing a negative correlation. In PAY0 vs PAY2, we see that borrowers who prepaid have a negative correlation and thus did not default since they are prepaying. Borrowers who have a positive correlation between each month which indicates they deferred have a higher chance of defaulting.
# Violin Plot of Pay 6 and Pay 5
sns.set(font_scale=1)
f, ax = plt.subplots(figsize=(15, 10))
sns.violinplot(x="PAY_6", y="PAY_5", hue="default_payment_next_month", data=df,
split=True, inner="quart",palette='rocket')
Examining the differences in payment history was best viewed in a violin plot comparison. The plot above clearly indicates an increased likelihood of default for increased payment delays made near the end of the 6 month period. This makes sense logically as delaying payments after having made several already would indicate struggling to make payments and therefore also likely to default.
f, ax = plt.subplots(figsize=(15, 10))
sns.violinplot(x="PAY_0", y="PAY_2", hue="default_payment_next_month", data=df,
split=True, inner="quart",palette='rocket')
Similarly to the previous plot, this plot shows a payment delay made at the start or 2nd repayment will likely immediately lead to a client defaulting as would be expected.
Task: Payment Default
Based on the results from Lab 2, the Random Forest model outperformed other models for predicting payment default and will therefore be used as our base model to measure improvements using clustering in the following 4 methods.
For each method described above we used a silhouette score to measure the effectiveness as explained in Business Understanding. This value determined the best clustering parameters for each respective method. We then ran those parameters through our base random forest model to look for improvements in the accuracy metrics
%%time
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
import random
#Setting Seed
#will use this in CV portion
random.seed(10)
seed = random.randint(1,500)
print("seed is:",seed)
#Creating the Task 1 CV
num_cv_iterations = 10
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
random_state = seed,
test_size = 0.2)
print(cv_object)
In the above section, we are creating the CV object that we will use to test the metrics of the model. The seed is set at 293 to make similar comparisons. We used an 80/20 split for this task.
%%time
#Code utilized from https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
from imblearn.over_sampling import SMOTE
from collections import Counter
if 'default_payment_next_month' in df_new:
y = df_new['default_payment_next_month'].values
del df_new['default_payment_next_month']
X = df_new.values
#Saving out the column names , so we can make dataframes later on
col = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','PAY_0',
'PAY_2','PAY_3','PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2',
'BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2',
'PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
os = SMOTE(random_state=seed)
#Our new datasets to use will be X_res,y_res
#X_res is the resample dataset that is now more balance
#Y_res is the target column that is now more balance.
X_res, y_res = os.fit_resample(X, y)
#Previous class distribtion
counter = Counter(y)
print("Previous class breakdown:",counter)
# summarize the new class distribution
counter_res = Counter(y_res)
print("OS_breakout",counter_res)
Our dataset was initially highly imbalanced toward non-defaulters which would make classifcations for our models difficult. To overcome this shortfall, we oversampled the minority class by utilizing the SMOTE technique from imblearn package. SMOTE works by creating synthetic copies of our minority class examples until there is an equal amount of them compared to the majority class examples. As seen above this was initially only 6605 defaults compared to 22996 non-defaults, using the technique raised the amount of default examples to the same count. Now the models will less favor one class over the other and potentially produce higher prediction metrics such as accuracy, sensitivity and specificity. The OS dataset used for the model will be broken into the arrays X_res , y_res.
%%time
from sklearn.model_selection import train_test_split
#reorg the dataframe X_res then continue with the process
X_rs = pd.DataFrame(data=X_res,columns=col)
imp_col = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_5',
'BILL_AMT1', 'BILL_AMT2', 'PAY_AMT1', 'PAY_AMT2']
X_rs = X_rs[imp_col]
y_rs = pd.DataFrame(data=y_res,columns=['default_payment_next_month'])
X_train, X_test, y_train, y_test = train_test_split(X_rs, y_rs, test_size = 0.20)
X_rs = X_rs.values
y_rs = y_rs['default_payment_next_month'].values
%%time
from sklearn.model_selection import ShuffleSplit
import imblearn
import numpy as np
#Random Forest
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics as mt
from sklearn.pipeline import Pipeline
std_scl = StandardScaler()
rf_clf = RandomForestClassifier()
piped_object = Pipeline([('scale', std_scl),
('Random_Forest', rf_clf)])
Iteration = []
Accuracy = []
Sensitivity = []
Specificity = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X_rs,y_rs)):
piped_object.fit(X_rs[train_indices],y_rs[train_indices]) # train object
y_hat = piped_object.predict(X_rs[test_indices]) # get test set precitions
cm1 = mt.confusion_matrix(y_rs[test_indices],y_hat)
Iteration.append(iter_num)
Accuracy.append(mt.accuracy_score(y_rs[test_indices],y_hat))
Sensitivity.append(cm1[0,0]/(cm1[0,0]+cm1[0,1]))
Specificity.append(cm1[1,1]/(cm1[1,0]+cm1[1,1]))
rf_base_results = pd.DataFrame({'Iteration':Iteration,'Accuracy': Accuracy,'Sensitivity':Sensitivity,
'Specificity':Specificity},columns = ['Iteration','Accuracy','Sensitivity','Specificity'])
rf_base_results
The above table displays the metrics for our best model from Lab 2 for which each clustering method will be applied and compared to see what improvements can be obtained. Accuracy will measure our models ability to correctly identify whether the client will or will not default. Sensitivity will tell us how many clients were correctly identified as having defaulted while Specificity measures correct predictions on those that did not default
%%time
#Utlized code from
#https://towardsdatascience.com/k-means-dbscan-gmm-agglomerative-clustering-mastering-the-popular-models-in-a-segmentation-c891a3818e29
#https://github.com/IDB-FOR-DATASCIENCE/Segmentation-Modelling/blob/main/Segmentation%20Notebook%20_Final.ipynb
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
centers = list(range(2,10))
scaled_X_train = std_scl.fit_transform(X_train)
model = KMeans(init='k-means++',random_state=seed)
visualizer = KElbowVisualizer(model, k=(2,10),metric='silhouette', timings= True, locate_elbow=False)
visualizer.fit(scaled_X_train) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
df3 = pd.DataFrame(visualizer.k_values_,columns=['centers'])
df3['scores'] = visualizer.k_scores_
df4 = df3[df3.scores == df3.scores.max()]
print('Optimal number of clusters based on silhouette score:', df4['centers'].tolist())
In the visual above we see that for the KMeans algorithm 2 clusters was significantly better with a silhouette score at approximately 0.326. K values were kept within a range of 2-9 as our computer lacked the computing power to go much beyond that. There was an immediate drop in score after 2 clusters with little to no improvement or change for the selected range suggesting more clusters would do little to improve the silhouette score. This also is fits with what is known as the elbow method in which the chart resembles an arm like shape indicating the best fit is at the point of inflection on the curve (the "elbow").
%%time
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
y = y_rs
X = X_rs
std_scl = StandardScaler()
rf_clf = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
piped_object = Pipeline([('scale', std_scl),
('Random_Forest', rf_clf)])
Iteration = []
Accuracy = []
Sensitivity = []
Specificity = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
cls = KMeans(n_clusters=2, init='k-means++',random_state=seed)
X1 = X[train_indices]
cls.fit(X1)
newfeature = cls.labels_
X1 = np.column_stack((X[train_indices],pd.get_dummies(newfeature)))
X2 = X[test_indices]
cls.fit(X2)
newfeature = cls.labels_
X2 = np.column_stack((X[test_indices],pd.get_dummies(newfeature)))
piped_object.fit(X1,y[train_indices])
y_hat = piped_object.predict(X2)
cm1 = mt.confusion_matrix(y[test_indices],y_hat)
Iteration.append(iter_num)
Accuracy.append(mt.accuracy_score(y[test_indices],y_hat))
Sensitivity.append(cm1[0,0]/(cm1[0,0]+cm1[0,1]))
Specificity.append(cm1[1,1]/(cm1[1,0]+cm1[1,1]))
rf_Kmeans_results = pd.DataFrame({'Iteration':Iteration,'Accuracy': Accuracy,'Sensitivity':Sensitivity,
'Specificity':Specificity},columns = ['Iteration','Accuracy','Sensitivity','Specificity'])
rf_Kmeans_results
In the above table we see the results of the KMeans algorithm applied to the Random Forest model with 2 clusters. This had little to no change to the accuracy, sensitivity or specificity indicating it did not improve the functionality of the model.
%%time
#Utlized code from
#https://towardsdatascience.com/k-means-dbscan-gmm-agglomerative-clustering-mastering-the-popular-models-in-a-segmentation-c891a3818e29
#https://github.com/IDB-FOR-DATASCIENCE/Segmentation-Modelling/blob/main/Segmentation%20Notebook%20_Final.ipynb
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
def get_dbscan_score(data, center):
'''
INPUT:
data - the dataset you want to fit kmeans to
center - the number of centers you want (the k value)
OUTPUT:
score - the Silhouette Score for DBSCAN
'''
#instantiate kmeans
dbscan = DBSCAN(eps= 1.9335816413107338, min_samples=center)
# Then fit the model to your data using the fit method
model = dbscan.fit(scaled_X_train)
# Calculate Silhoutte Score
score = silhouette_score(scaled_X_train, model.labels_, metric='euclidean')
return score
scores = []
for center in centers:
scores.append(get_dbscan_score(scaled_X_train, center))
plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('min_samples');
plt.ylabel('Silhouette Score');
plt.title('Silhouette Score vs. min_samples');
df3 = pd.DataFrame(centers,columns=['min_samples'])
df3['scores'] = scores
df4 = df3[df3.scores == df3.scores.max()]
print('Optimal number of min_samples based on silhouette score:', df4['min_samples'].tolist())
In the visual above we see that for the DBSCAN algorithm, 6 clusters had the best silhouette score at approximately 0.138. K values were again kept within a range of 2-9 due to computing power. We see a drop in score at 9 clusters though overall the silhouette scores for each K was relatively low indicating high amounts of overlap in each case.
%%time
from sklearn.cluster import DBSCAN
X1 = X_rs
cls = DBSCAN(eps=1.9335816413107338, min_samples=6)
cls.fit(X1)
newfeature = cls.labels_
y = y_rs
X = X_rs
X = np.column_stack((X,pd.get_dummies(newfeature)))
std_scl = StandardScaler()
rf_clf = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
piped_object = Pipeline([('scale', std_scl),
('Random_Forest', rf_clf)])
Iteration = []
Accuracy = []
Sensitivity = []
Specificity = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
piped_object.fit(X[train_indices],y[train_indices]) # train object
y_hat = piped_object.predict(X[test_indices]) # get test set precitions
cm1 = mt.confusion_matrix(y[test_indices],y_hat)
Iteration.append(iter_num)
Accuracy.append(mt.accuracy_score(y[test_indices],y_hat))
Sensitivity.append(cm1[0,0]/(cm1[0,0]+cm1[0,1]))
Specificity.append(cm1[1,1]/(cm1[1,0]+cm1[1,1]))
rf_DB_results = pd.DataFrame({'Iteration':Iteration,'Accuracy': Accuracy,'Sensitivity':Sensitivity,
'Specificity':Specificity},columns = ['Iteration','Accuracy','Sensitivity','Specificity'])
rf_DB_results
In the above table we see the results of the DBSCAN algorithm applied to the Random Forest model with 8 clusters. This had little to no change to the accuracy, sensitivity or specificity indicating it did not improve the functionality of the model.
%%time
#Utlized code from
#https://towardsdatascience.com/k-means-dbscan-gmm-agglomerative-clustering-mastering-the-popular-models-in-a-segmentation-c891a3818e29
#https://github.com/IDB-FOR-DATASCIENCE/Segmentation-Modelling/blob/main/Segmentation%20Notebook%20_Final.ipynb
from yellowbrick.cluster import KElbowVisualizer
centers = list(range(2,10))
model = AgglomerativeClustering()
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,10),metric='silhouette', timings= True, locate_elbow=False)
visualizer.fit(scaled_X_train) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
df3 = pd.DataFrame(visualizer.k_values_,columns=['centers'])
df3['scores'] = visualizer.k_scores_
df4 = df3[df3.scores == df3.scores.max()]
print('Optimal number of clusters based on silhouette score:', df4['centers'].tolist())
In the visual above we see that for the Agglomerative algorithm 2 clusters was significantly better with a silhouette score at approximately 0.243. There was an immediate drop in score after 2 clusters with little to no improvement or change for the selected range. This was seen previously using the elbow method indicating this as an optimal value.
%%time
from sklearn.cluster import AgglomerativeClustering
std_scl = StandardScaler()
rf_clf = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
piped_object = Pipeline([('scale', std_scl),
('Random_Forest', rf_clf)])
Iteration = []
Accuracy = []
Sensitivity = []
Specificity = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
cls = AgglomerativeClustering(n_clusters=2, linkage='ward')
X1 = X[train_indices]
cls.fit(X1)
newfeature = cls.labels_
X1 = np.column_stack((X[train_indices],pd.get_dummies(newfeature)))
X2 = X[test_indices]
cls.fit(X2)
newfeature = cls.labels_
X2 = np.column_stack((X[test_indices],pd.get_dummies(newfeature)))
piped_object.fit(X1,y[train_indices])
y_hat = piped_object.predict(X2)
cm1 = mt.confusion_matrix(y[test_indices],y_hat)
Iteration.append(iter_num)
Accuracy.append(mt.accuracy_score(y[test_indices],y_hat))
Sensitivity.append(cm1[0,0]/(cm1[0,0]+cm1[0,1]))
Specificity.append(cm1[1,1]/(cm1[1,0]+cm1[1,1]))
rf_AGG_results = pd.DataFrame({'Iteration':Iteration,'Accuracy': Accuracy,'Sensitivity':Sensitivity,
'Specificity':Specificity},columns = ['Iteration','Accuracy','Sensitivity','Specificity'])
rf_AGG_results
In the above table we see the results of the Agglomerative algorithm applied to the Random Forest model with 2 clusters. This had little to no change to the accuracy, sensitivity or specificity indicating it did not improve the functionality of the model.
%%time
#Utlized code from
#https://towardsdatascience.com/k-means-dbscan-gmm-agglomerative-clustering-mastering-the-popular-models-in-a-segmentation-c891a3818e29
#https://github.com/IDB-FOR-DATASCIENCE/Segmentation-Modelling/blob/main/Segmentation%20Notebook%20_Final.ipynb
from sklearn.mixture import GaussianMixture
centers = list(range(2,10))
n_components = range(2, 10)
covariance_type = ['spherical', 'tied', 'diag', 'full']
score=[]
for cov in covariance_type:
for n_comp in n_components:
gmm=GaussianMixture(n_components=n_comp,covariance_type=cov,random_state = seed)
model = gmm.fit(scaled_X_train)
model_2 = model.predict(scaled_X_train)
score_s = silhouette_score(scaled_X_train, model_2, metric='euclidean')
score.append((cov,n_comp,score_s))
score_1 = pd.DataFrame(score)
score_1.columns = ['Covariance_Type', 'N_Components','Silhouette_Score']
score_2 = score_1[score_1.Silhouette_Score == score_1.Silhouette_Score.max()]
score_2.head(n=2)
For the Gaussian Mixture algorithm we first determined the best covariance type based on the silhouette score generated which as seen above was spherical type at 2 clusters with a silhouette score of 0.438.
%%time
# Gassian Mixture Model covariance type: spherical
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
scores = []
K = range(2,10)
for k in K:
g_co = GaussianMixture(n_components=k, covariance_type='spherical')
g_fit = g_co.fit(scaled_X_train)
g_pred = g_fit.predict(scaled_X_train)
sc = silhouette_score(scaled_X_train, g_pred)
scores.append(sc)
print("Highest Score:", max(scores), "at K =", scores.index(max(scores))+2) #starting at k=2
sns.set_context("talk")
plt.plot(K, scores, "-g")
plt.xlabel('K (Number of Centroids)')
plt.ylabel('Silhouette Score')
plt.title('Gaussian Mixture Model: spherical Covariance')
plt.show()
In the visual above we see that for the Gaussian Mixture algorithm, 2 clusters again identified as significantly better with a silhouette score at approximately 0.438. There was a similar immediate drop in score after 2 clusters with little to no improvement or change for the selected range. This was seen previously using the elbow method indicating this as an optimal value.
%%time
from sklearn.mixture import GaussianMixture
X1 = X_rs
cls = GaussianMixture(n_components=2, covariance_type='spherical')
newfeature = cls.fit_predict(X1)
y = y_rs
X = X_rs
X = np.column_stack((X,pd.get_dummies(newfeature)))
std_scl = StandardScaler()
rf_clf = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
piped_object = Pipeline([('scale', std_scl),
('Random_Forest', rf_clf)])
Iteration = []
Accuracy = []
Sensitivity = []
Specificity = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
piped_object.fit(X[train_indices],y[train_indices]) # train object
y_hat = piped_object.predict(X[test_indices]) # get test set precitions
cm1 = mt.confusion_matrix(y[test_indices],y_hat)
Iteration.append(iter_num)
Accuracy.append(mt.accuracy_score(y[test_indices],y_hat))
Sensitivity.append(cm1[0,0]/(cm1[0,0]+cm1[0,1]))
Specificity.append(cm1[1,1]/(cm1[1,0]+cm1[1,1]))
rf_GMM_results = pd.DataFrame({'Iteration':Iteration,'Accuracy': Accuracy,'Sensitivity':Sensitivity,
'Specificity':Specificity},columns = ['Iteration','Accuracy','Sensitivity','Specificity'])
rf_GMM_results
In the above table we see the results of the Guassian Mixture algorithm applied to the Random Forest model with 2 clusters. This had little to no change to the accuracy, sensitivity or specificity indicating it did not improve the functionality of the model.
Model_compare = pd.DataFrame(
{'Model': ['KMeans','DBSCAN','Agglomerative','Gaussian Mixture','Random Forest (No Clustering)'],
'Accuracy': ['0.8122296','0.8125882','0.8119688','0.8116751','0.8117839'],
'Sensitivity': ['0.8267386','0.8281483','0.8266373','0.8259528','0.8270405'],
'Specificity': ['0.7977643','0.7970641','0.7973199','0.7974332','0.7965588'],
'Silhouette Score': ['0.326','0.138','0.243','0.438',''],
'N_Components': ['2','8','2','2',''],
'SC Tuning Time': ['1m55s','6m52s','10m19s','8m8s',''],
'CV Runtime': ['43.9s','43.7s','6m16s','1m52s',''],
})
pd.set_option("max_colwidth", 3000)
Model_compare
As seen in the results above the best silhouette score was determined for each type of clustering with the Guassian Mixture indicating the best results at 0.438 where 1.0 would be an optimal score. Each clustering method identified only 2 clusters within the data with the exception of DBSCAN which identified 8 clusters, though this had the lowest Silhouette Score comparatively. In terms of runtimes KMeans had the best result at approximately 1-2 minutes, whereas the Gaussian Mixture method with the highest score took about 4 times longer.
In terms of the Accuracy, Sensitivity and Specificity there is no indication of significant differences between using the clustering methods and the base Random Forest model without any clustering.
%%time
#Code utlized from https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
#Had to delete the time part as it was causing script to error
import seaborn as sns
def plot_clusters(data, algorithm, args, kwds):
labels = algorithm(*args, **kwds).fit_predict(data)
palette = sns.color_palette('deep', np.unique(labels).max() + 1)
colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
plt.scatter(data.T[0], data.T[1], c=colors, **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)
plt.title('Clusters found by {}'.format(str(algorithm.__name__)), fontsize=18)
%%time
size = (12,8)
plt.subplots(2,2,figsize=size)
plt.subplot(2,2,1)
plot_clusters(data, cluster.KMeans, (), {'n_clusters':2})
plt.subplot(2,2,2)
plot_clusters(data, cluster.AgglomerativeClustering, (), {'n_clusters':2, 'linkage':'ward'})
plt.subplot(2,2,3)
plot_clusters(data, DBSCAN, (), {'eps':1.9335816413107338, 'min_samples':8})
plt.subplot(2,2,4)
plot_clusters(data, GaussianMixture, (), {'n_components':2, 'covariance_type':'spherical'})
In the above visualization we see the clustering groups that were identified for each method. In each of them, no clearly distinct cluster groups can be seen as the data mostly overlaps. This result explains the low Silhouette Scores and lack of clusters identified by most methods. Ultimately this indicates the data is too similar to find any clear distinctions and would do little to improve the accuracy metrics for the base Random Forest model.
%%time
from sklearn.metrics import roc_curve
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
X_train, X_test, y_train, Y_test = train_test_split(X_rs, y_rs, test_size = 0.20)
std_scl.fit(X_train)
X_train = std_scl.transform(X_train) # apply to training
X_test = std_scl.transform(X_test)
##RF
Random_Forest = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
##GMM
G1 = X_train
GM = GaussianMixture(n_components=2, covariance_type='spherical')
G1feature = GM.fit_predict(G1)
G1 = np.column_stack((X_train,pd.get_dummies(G1feature)))
G2 = X_test
G2feature = GM.fit_predict(G2)
G2 = np.column_stack((X_test,pd.get_dummies(G2feature)))
GMM_RF = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
###Kmeans
KM = KMeans(n_clusters=2, init='k-means++',random_state=seed)
K1 = X_train
K1fit = KM.fit(K1)
K1feature = K1fit.labels_
K1 = np.column_stack((X_train,pd.get_dummies(K1feature)))
K2 = X_test
K2fit = KM.fit(K2)
K2feature = K2fit.labels_
K2 = np.column_stack((X_test,pd.get_dummies(K2feature)))
KM_RF = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
###AGG
Ag = AgglomerativeClustering(n_clusters=2, linkage=link)
A1 = X_train
A1fit = Ag.fit(A1)
A1feature = A1fit.labels_
A1 = np.column_stack((X_train,pd.get_dummies(A1feature)))
A2 = X_test
A2fit = Ag.fit(A2)
A2feature = A2fit.labels_
A2 = np.column_stack((X_test,pd.get_dummies(A2feature)))
Ag_RF = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
##DBscan
DB = DBSCAN(eps=1.9335816413107338, min_samples=8)
D1 = X_train
D1feature = DB.fit_predict(D1)
D1 = np.column_stack((X_train,pd.get_dummies(D1feature)))
D2 = X_test
D2feature = DB.fit_predict(D2)
D2 = np.column_stack((X_test,pd.get_dummies(D2feature)))
DB_RF = RandomForestClassifier(bootstrap= 'bool', criterion= 'gini', max_depth= 100,
max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators = 100)
#####
Random_Forest.fit(X_train,y_train)
GMM_RF.fit(G1,y_train)
KM_RF.fit(K1,y_train)
Ag_RF.fit(A1,y_train)
DB_RF.fit(D1,y_train)
y_pred_prob1 = Random_Forest.predict_proba(X_test)[:,1]
fpr1 , tpr1, thresholds1 = roc_curve(Y_test, y_pred_prob1)
y_pred_prob2 = GMM_RF.predict_proba(G2)[:,1]
fpr2 , tpr2, thresholds2 = roc_curve(Y_test, y_pred_prob2)
y_pred_prob3 = KM_RF.predict_proba(K2)[:,1]
fpr3 , tpr3, thresholds2 = roc_curve(Y_test, y_pred_prob3)
y_pred_prob4 = Ag_RF.predict_proba(A2)[:,1]
fpr4 , tpr4, thresholds2 = roc_curve(Y_test, y_pred_prob4)
y_pred_prob5 = DB_RF.predict_proba(D2)[:,1]
fpr5 , tpr5, thresholds2 = roc_curve(Y_test, y_pred_prob5)
plt.plot([0,1],[0,1], 'k--')
plt.plot(fpr1, tpr1, label= "RF")
plt.plot(fpr2, tpr2, label= "GMM")
plt.plot(fpr3, tpr3, label= "Kmeans")
plt.plot(fpr4, tpr4, label= "Agglomerative")
plt.plot(fpr5, tpr5, label= "DB")
plt.legend(loc="lower right")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title('ROC')
plt.show()
For the KMeans clustering method there was very little impact on the base Random Forest model. A high amount of overlap was identified by multiple sources in the 2 clusters the were selected as optimal. When applied to the Random Forest model there was a 0.0004% improvement to the average accuracy of the model which is highly likely to not be statistically significant. The Sensitivity showed a 0.001% decrease in performance while the Specificity increased by a similar amount, both insignificant changes.
Both the Agglomerative and Gaussian Mixture method results were nearly identical to the those for the KMeans method, performing slightly worse in all metrics. In terms of silhouette score the Gaussian Mixture was the best by far and displayed a slightly better visual distinctions of the clusters as well. However as seen in the other methods this ultimately had little to no impact on the base model.
DBSCAN was unique in that the method identified more clusters than other methods however the silhouette scores were relatively low for all n-clusters identified with this method compared to others. We could also see in the visual representation that all identified clusters were highly overlapped. The results from using this on the base model also had little to no impact.
Ultimately based on the information from several sources we can assume that all clustering methods had a difficult time identifying clearly distinct clusters as the data was too similar. This of course meant the methods would do little to improve the accuracy metrics beyond what our base model was already capable of.
In terms of ramifications our initial accuracy was already quite reasonable at 81% and given the subject matter, some incorrect predictions on clients defaulting will have little to no negative impact over not using a predictive model. To further increase the effectiveness of our model is it likely we would need to aquire more data that includes additional attribute information.
The effectiveness of a good classification algorithm is one that produces strong accuracy, sensitivity, and specificity scores through cross validation. If an effective classification model can be built, the credit company will have the ability to proactively monitor borrowers in various credit stages. The significance of identifying default or not will allow the credit card to minimize some of their losses. If early default identification occurs, the credit card company can reduce the borrower’s credit limits or preemptively work with the borrower to create new repayment plans. Both outcomes will help the credit company reduce their losses that would occur if no action were taken.
(10 points total)
Did you achieve your goals? If not, can you reign in the utility of your modeling?
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)?
How would you measure the model's value if it was used by these parties?
How would you deploy your model for interested parties?
What other data should be collected?
How often would the model need to be updated, etc.?
We used Yellow-Brick to visualize the silhouette scoring for each respective clustering method.
To gain additional insight we created a similarity matrix which provides a quick look at how distinctly different the values in each client from the data are from each other. This is calculated by creating distances that represent the dimensionality of features for each client. A small distance between clients would indicate they were very similar, whereas large distances would be found for clients typically considered outliers as they have little similarity to most other clients. The distances are then converted to similarity scores with 1 representing very similar clients and 0 indicating high amounts of differences.
Running the entire data set did initally result in overloading the system as it could not create enough space for the matrix it was attempting to create. To circumvent this issue we used increasing samples of data and examined the results as seen below for 500, 5000, and 10000 clients of a total 22996.
df_sample_500 = df_base.head(500)
df_sample_5000 = df_base.head(5000)
df_sample_10k = df_base.head(10000)
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
n_clusters = 2
model = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1).fit(df_sample_500)
# plot the similarity matrix
from sklearn.metrics import pairwise
y=model.labels_
idx_sorted = np.argsort(y,kind="quicksort") # need to get the ordering of Y
#data_sorted = df_base[idx_sorted] # sort the dataset by class
R = pairwise.euclidean_distances(df_sample_500) # calculate the similarity
#transform distance to similarity
min_r = np.min(R)
max_r = np.max(R)
R = 1-(R-min_r)/(max_r-min_r)
%%time
plt.figure(figsize=(20,10))
import matplotlib.pyplot as plt
plt.pcolormesh(R)
plt.colorbar()
# plot class boundaries
bounds = np.cumsum([np.sum(y==val) for val in np.unique(y)])
for b in bounds:
plt.plot([b,b],[0, len(y)],'k',linewidth=4)
plt.plot([0, len(y)],[b,b],'k',linewidth=4)
plt.show()
%%time
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
n_clusters = 2
model = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1).fit(df_sample_5000)
# plot the similarity matrix
from sklearn.metrics import pairwise
y=model.labels_
idx_sorted = np.argsort(y,kind="quicksort") # need to get the ordering of Y
#data_sorted = df_base[idx_sorted] # sort the dataset by class
R = pairwise.euclidean_distances(df_sample_5000) # calculate the similarity
#transform distance to similarity
min_r = np.min(R)
max_r = np.max(R)
R = 1-(R-min_r)/(max_r-min_r)
plt.figure(figsize=(20,10))
import matplotlib.pyplot as plt
plt.pcolormesh(R)
plt.colorbar()
# plot class boundaries
bounds = np.cumsum([np.sum(y==val) for val in np.unique(y)])
for b in bounds:
plt.plot([b,b],[0, len(y)],'k',linewidth=4)
plt.plot([0, len(y)],[b,b],'k',linewidth=4)
plt.show()
%%time
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
n_clusters = 2
model = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1).fit(df_sample_10k)
# plot the similarity matrix
from sklearn.metrics import pairwise
y=model.labels_
idx_sorted = np.argsort(y,kind="quicksort") # need to get the ordering of Y
#data_sorted = df_base[idx_sorted] # sort the dataset by class
R = pairwise.euclidean_distances(df_sample_10k) # calculate the similarity
#transform distance to similarity
min_r = np.min(R)
max_r = np.max(R)
R = 1-(R-min_r)/(max_r-min_r)
plt.figure(figsize=(20,10))
import matplotlib.pyplot as plt
plt.pcolormesh(R)
plt.colorbar()
# plot class boundaries
bounds = np.cumsum([np.sum(y==val) for val in np.unique(y)])
for b in bounds:
plt.plot([b,b],[0, len(y)],'k',linewidth=4)
plt.plot([0, len(y)],[b,b],'k',linewidth=4)
plt.show()
From the above charts we can see a distinct difference between the 500 sample matrix against the others. Comparing the 5000 and 10000 sample matrix shows little differences which supports the reasoning that using all 22996 data points would likely show very little difference.
Overall the results are mostly at or near 1 for all clients indicating that they are very similar to each other with some exceptions where similarity scores drop to about 0.6 for certain clients. There is little indication of clustering groups here as the pattern remains mostly consistent throughout the data. This makes sense given the results we previously found with low number of cluster groups and Silhouette Scores identified by most methods.